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You only look once 


Numerous technologies and systems, including autonomous vehicles, 
surveillance systems, and robotic applications, rely on the capability to 
accurately detect pedestrians to ensure their safety. As the demand for real- 
time object detection continues to rise, many researchers have dedicated 
their efforts to developing effective and trustworthy algorithms for 
pedestrian recognition. By integrating learning complexity-aware cascades 
with an enhanced you only look once (YOLO) algorithm, the paper presents 
a real-time system for identifying both items and pedestrians. The 
performance of the proposed approach is evaluated using the Karlsruhe 
Institute of Technology and Toyota Technological Institute (KITTI 
pedestrian dataset across both the v4 and v8 versions of the YOLO 
framework. Prioritizing both speed and accuracy, the enhanced YOLO 
algorithm outperforms its baseline counterpart. The demonstrated superiority 
of the suggested technique on the KITTI pedestrian dataset underscores its 
effectiveness in real-world contexts. Furthermore, the complexity-aware 
learning cascades contribute to a streamlined detection model without 
compromising performance. When applied to scenarios requiring real-time 
identification of objects and individuals, the proposed method consistently 
delivers promising outcomes. 
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1. INTRODUCTION 


Utilizing computer vision for real-time pedestrian detection is crucial for safety in autonomous 
vehicles, surveillance, robotics, and automation. While you only look once (YOLO) models, particularly 
YOLOv3, have shown promise, challenges like occlusion and false positives persist. Adaptations, such as 
complexity-aware cascades and enhanced YOLOv?2, aim to overcome these hurdles. Additionally, studies 
propose combining YOLOv3 with cascaded region-based convolutional neural network (R-CNN) and 
employing Kalman filters for real-time object tracking. Recent algorithms like YOLOv3-tiny, YOLOv4, and 
YOLOv4-tiny have found applications in pedestrian, vehicle, and obstacle detection [1]. This paper 
contributes an enhanced YOLO iteration, focusing on small object identification and false positive reduction. 
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Further research explores YOLO deployment for real-time object detection and tracking and the synergy of 
YOLO with Kalman filters in low-light conditions [2]. While Liu’s work emphasizes YOLOv3 for pedestrian 
recognition in smart urban settings [3], the broader field of deep learning-based intelligent transportation 
systems primarily centers on vehicle detection. An emerging breakthrough is the fully convolutional one-stage 
(FCOS) 3D object identification method [4]. Figure 1 shows the illustrates bounding boxes incorporating 
predictive information about dimensions and spatial positioning. 


Figure 1. Illustrates bounding boxes incorporating predictive information about dimensions and spatial 
positioning 


In various academic domains, the practice of comparing the contents of objects has become essential, 
especially in fields like computer vision and machine learning. Object detection, particularly in real-time 
scenarios, has gained significance, with the YOLO algorithm being a notable deep learning-based solution. 
Researchers, exemplified by [5], have enhanced the YOLO model, introducing complexity-aware cascades for 
efficient real-time object and pedestrian recognition. Their study compares the proposed model with leading 
techniques like faster R-CNN and RetinaNet, showcasing competitive performance across publicly available 
datasets, emphasizing the importance of object content comparison in model development and evaluation [6]. 
In the realm of computer vision, the concept of “pedestrian well-exposure” is crucial for the clarity of 
pedestrian recognition systems. Researchers, as demonstrated in [7], have introduced an innovative algorithm 
assessing pedestrians’ well-exposure in images using a deep neural network. This approach enhances 
pedestrian recognition precision by predicting exposure levels for each pixel as shown in Figure 2. Thorough 
testing on diverse datasets demonstrates the superiority of their system over existing techniques. Additionally, 
the use of log-normalized heat maps aids in visualizing these aspects. This research contributes insights into 
optimizing real-time pedestrian and object detection through effective exposure assessment. 


S x S grid on input Final detections 
Class probability map 


Figure 2. Sequential operation of YOLO learning chips within an aware cascade framework, highlighting 
their modules, and fundamental concept 
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In a comprehensive study [8], a multifaceted framework for pedestrian recognition is introduced, 
incorporating tasks such as evaluating well-exposure using deep learning to enhance pedestrian recognition. 
Considering environmental factors, especially “pedestrian well-exposure,” is crucial when developing computer 
vision algorithms for pedestrian identification. This consideration improves the precision of pedestrian detection 
for real-world applicability. Addressing the challenge of pedestrian recognition involves analyzing statistical 
attributes inherent in datasets, exemplified by the prevalent smaller objects in the Karlsruhe Institute of 
Technology and Toyota Technological Institute (KITTD dataset [9]. Figure 3 illustrates a heatmap indicating 
the clustering tendency of child pedestrians, emphasizing the need for specific attention to small item detection, 
especially in central image regions, for accurate identification of compact pedestrians [10]. 
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Figure 3. Example of COCO dataset 


2. RELATED WORK 

Numerous efforts in pedestrian detection involve manually crafted features, such as the “histogram 
of oriented gradient for person detection” descriptor [11], strengthened by integral channel features (ICF). 
The aggregate channel features (ACF) technique extends Haar features, histograms, and local sums [12], 
demonstrating efficacy in real-time facial recognition when fused with AdaBoost classifiers. The fast R-CNN 
method employs region proposals for object detection [13]. The KITTI vision benchmark suite dataset is 
widely used, evaluating diverse object detection algorithms, including YOLO. The city persons dataset [14] 
focuses on pedestrians in urban environments, inspiring innovations like the scale-aware fast R-CNN. In 
contemporary object detection, deep convolutional neural networks (CNNs) are prominent, addressing 
objects of varying sizes and aspect ratios [15]. Scene-adaptive people recognition systems, trained with target 
data [16], showcase remarkable effectiveness in pedestrian detection. While CNN methodologies have 
triumphed, the utility of manually constructed feature-based algorithms, like histogram of oriented gradients 
(HOG), remains evident. Strategies like quick feature pyramids and ICF enhance model performance on the 
INRIA pedestrian dataset [17]. For the PASCAL VOC dataset, [18] suggests adopting scale invariant CNNs 
to address objects with varying sizes and aspect ratios. Despite strides by deep CNNs, the pivotal roles of 
HOG and ICF persist in object detection. Recent breakthroughs and datasets like City People and KITTI have 
propelled object detection. Noteworthy is the rise of single-stage detectors prioritizing speed, exemplified by 
SSD [19]. YOLOv2 [20], an enhanced version, incorporates anchor boxes, and K-means clustering, 
enhancing training models. Efficient proposes a compound scaling strategy for elevated object detection 
performance, while RetinaNet introduces a focal loss function to mitigate class imbalance issues [21]. 


3. METHOD 

The common objects in context (COCO) dataset [22] plays a prominent role as a widely used 
benchmark in the domains of object detection, segmentation, and captioning tasks. Recognized as one of the 
most expansive and diverse datasets, COCO boasts a collection exceeding 330,000 images as illustrated in 
Figure 4 and encompasses more than 2.5 million annotated instances across a spectrum of 80 distinct item 
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categories. The dataset’s popularity within the realm of computer vision research is owed to its challenging 
nature, often presenting intricate scenes and instances of occlusion that test the limits of algorithms. The 
Caltech pedestrian dataset [23] stands as a definitive reference for evaluating pedestrian detection 
methodologies. Spanning a continuous ten-hour sequence captured at 640x480 resolution and a frame rate of 
30 frames per second, this dataset is meticulously annotated with bounding boxes encompassing both the entire 
scene and all individuals present. Comprising over 350,000 images and approximately 250,000 frames, its 
appeal arises from the intricate occlusion scenarios it covers and the realistic portrayal of pedestrian-congested 
environments. In a collaborative endeavor between the KITTI dataset has emerged as a pioneering benchmark 
within the field of autonomous driving research. Notably, this dataset incorporates a dedicated segment 
focused on pedestrian detection. With a collection surpassing 15,000 images captioned with individuals and 
7,481 images capturing urban landscapes from a moving vehicle, it is widely acknowledged that the dataset’s 
depiction of pedestrian scenarios amid natural barriers stands among the most authentic and valuable 
resources available. In our investigation of pedestrian detection methods, we initially utilized a pre-trained 
HOG detector from the OpenCV library, known for its efficiency [24]. The detector showed promising 
results in image-based detection but exhibited limitations in dynamic video scenes. To assess its performance 
in a video context, we developed a technique to extract frames and applied the HOG detector to each frame, 
creating a composite image to visualize detected pedestrians over time. Despite encouraging image-based 
results, the method faced challenges in adapting to video-based detection, particularly in real-time scenarios 
and dynamic contexts, revealing the need for more advanced object detection techniques. Recognizing these 
challenges, this is crucial for applications like autonomous vehicles, surveillance, and robotics, where 
accurate identification is paramount in complex real-world scenarios. The goal was to leverage recent 
advances in object detection technology to overcome the limitations posed by the HOG-based detection 
method, especially in scenarios involving moving objects, occlusions, and varying lighting conditions [25]. 

The YOLO network originated in the Darknet framework but was adapted for integration into 
Google Colab by transitioning to a Python-based neural network framework. This transformation involved 
converting Darknet into TensorFlow, resulting in a customized version called Darkflow as show in Figure 5. 
The YOLOv?2 network architecture was based on a chosen design and its weights were initialized using 
pretrained weights from the original authors, trained on the COCO dataset, a comprehensive benchmark with 
a diverse array of object categories. 

Pretrained weights were integrated to provide the model with a learning head start, enhancing its 
performance for the specific study. Refinements to the output layer focused on a single class (individuals), and 
adjustments to the second-to-last convolutional layer optimized the number of filters for improved predictions. 
The COCO dataset was utilized for training, involving the dissection of videos into frames, resulting in a subset 
of 75,057 frames featuring pedestrians. Annotation data was converted to Darkflow-compatible XML format, 
and the dataset was split into training (67,552 frames) and test sets (7,505 frames). The SORT algorithm was 
implemented for real-time tracking, generating output files for video assembly. 

The SA YOLOv4 system divides the input image into inner and outer halves for nearby and distant 
pedestrian detection. Both halves undergo identical convolutional layer processing, generating feature maps. 
Two compact neural networks are trained to discern pedestrians at various distances. Non-maximum 
suppression (NMS) merges sub-network outcomes, assigning confidence scores and bounding boxes to each 
recognized pedestrian. SA YOLOv4 harmonizes outputs, achieving comprehensive pedestrian detection 
across different scales. In the YOLOv4 framework, class prediction optimization during training uses the 
binary cross-entropy loss mechanism. CSPDarknet-53 architecture replaces earlier counterparts, enhancing 
efficiency by eliminating redundant processing. The integration of a spatial attention module (SAM) captures 
spatial interdependencies for more precise object detection outcomes. YOLOv4 combines bag of freebies 
(BOF) and bag of specials (BOS) strategies, making it a highly advanced system with innovative elements 
and enhancements that collectively amplify object detection precision and effectiveness. The SA YOLOv4 
architecture, an enhanced iteration of the YOLOv4 object detection model, was meticulously crafted to excel 
in identifying individuals of varying statures on urban streets. Guided by the scene’s geometry, an initial 
segmentation of the input image into three distinct sections is executed, with particular emphasis on 
pedestrians situated within the central portion depicted in Figure 6. Subsequently, both the entire image and 
its focal region undergo processing within the network. Through a process of stacked input convolution, the 
network generates feature maps. These derived feature maps are then fed into two distinct networks, each of 
which specializes in discerning pedestrians of different scales (ranging from larger to smaller dimensions). 

The network processes both the entire image and its central portion, creating feature maps through 
convolutional layers. It then bifurcates into two sub-networks for pedestrian recognition across varying 
scales. Each sub-network extracts scale-specific attributes through convolutional layers, and the resulting 
feature vector undergoes fully connected layers, generating output vectors for classification scores and 
bounding box coordinates. NMS amalgamates outcomes for a unified detection result. 
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Figure 4. Functional architecture of the pedestrian recognition 
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Figure 6. Geometrically-guided image segmentation with emphasis on pedestrian concentration 


The YOLO model, with cross-view computational models, shows promise in early skin disorder 
detection and crack resistance prediction. Its application in clinical studies benefits from combining deep 
learning with imaging technologies. However, domain complexity limits its full utilization. Hybrid deep 
learning algorithms offer increased predictive accuracy for image edge smoothing, with accelerated training 
durations and competence across COCO datasets. Refining YOLO architectures is crucial as deep learning 
gains popularity, with innovative design components contributing to advancements like cross-validation and 
interpretable machines as shown in Figure 7. 
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Figure 7. Steps for classifying an object and pedestrian for getting ensemble and holistic score 


Representational techniques help address complexity in image processing by identifying significant 
regions for predictions. Directed backpropagation is a unique strategy, but limitations exist. The model’s 
applicability is limited by training data diversity and spatial resolution. The YOLO model uses direct 
mapping from inputs to outputs, focusing on neurons with positive contributions. Perceptual techniques and 
cross-validation can expand problem space coverage. A three-dimensional graph illustrates the interplay of 
probability and distance in decoding and reconstructing objects and pedestrians. Figure 8 shows the 
tridimensional graph offers a visual representation of the interplay between the probability and distance 
variables in the process of decoding and reconstructing encoded bars for a given object and pedestrian 
detection. 
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Figure 8. Tridimensional graph illustrates how probability and distance interact during the decoding and 
reconstruction of encoded bars for object and pedestrian detection 


4. RESULTS AND DISCUSSION 

The system proposed for real-time object recognition in edge contexts, particularly on embedded 
hardware platforms, achieved a processing rate of 450 milliseconds. Aware cascade learning enhances 
performance by leveraging knowledge from the source task. Addressing color bias, the system employs color 
transformation and edge segmentation. Integrating deep learning into remote sensing shows potential for 
performance improvement, extracting general features and progressively specializing for specific tasks. 
Transfer learning initializes weight processes using pre-trained weights from deep YOLO models. The 
YOLO algorithm involves training a source network on a COCO dataset and transferring knowledge to a new 
network for different tasks. Gamma correction and tone mapping enhance highlights and tonal appearance. 
RAW image capture results in dark, unsuitable images for computer vision applications as shown in Figure 9. 
The findings demonstrate YOLOV8’s capability to efficiently retrieve frames from various sources while 
upholding YOLO’s exceptional object detection performance. On a GTX 1060 system, it achieved real-time 
object recognition for image dimensions of 128x128 and 256x256 with a latency of less than 0.1 seconds. 
Conversely, performance declined when processing larger images (416x416 or 608x608) on the same 
hardware, resulting in delays of 1.4 seconds for YOLOV8 at 416x416 and 2.8 seconds at 608x608. In 
contrast, YOLOV4 showcased a real-time processing delay under 0.3 seconds for the 608x608 image size. 
Object and pedestrian detection accuracy in relation to existing methods is provided in Table 1. 


Figure 9. Illustrates the identification and prediction of pedestrians on the road across diverse instances 
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Table 1. Outlines the accuracy of object and pedestrian detection in comparison to contemporary 


methodologies 
Article Technique Accuracy (%) 
[20] CNN 97.67 
[21] Recurrent neural network (RNN) and LTSM 98.27 
Proposed Learning complexity aware cascades and YOLO 98.93 


5. CONCLUSION 

The insights disclosed in the preceding sections unveil potential pathways to enhance the 
performance of the pedestal detection algorithm. To begin with, extending the training duration of the 
network with a diverse range of learning rates could notably bolster its ability to identify individuals who 
have adeptly navigated local minima. Secondly, optimizing the balance between resolution and frame rate has 
the potential to elevate real-time performance, with the added possibility of improving frame rates by lowering 
the resolution of the video material. Prior to the real-world deployment of such systems, it is crucial to conduct 
thorough testing in controlled environments involving real vehicles. Additionally, a promising avenue of 
exploration is the feasibility of integrating the proposed YOLOV4 as an auxiliary element for real-time 
processing within existing object identification methodologies. These guidelines can offer valuable insights 
into determining the most suitable hardware configuration for intelligent video applications built upon the 
YOLO framework. By addressing these research objectives in the future, the resilience and practicality of the 
pedestrian recognition system can be significantly enhanced for real-world scenarios. This study sheds light 
on the limitations of the current YOLO algorithm in terms of reliable real-time pedestrian identification, 
while concurrently proposing remedies through strategies like real-world testing, integration of YOLOV4 for 
real-time processing, and extended training with varied learning rates. By effectively addressing the 
deficiencies of the present YOLO algorithm, this research not only paves the way for selecting optimal 
hardware configurations but also contributes to the overall enhancement of accuracy and real-time 
performance in object detection software. 
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